AITopics | linear regression problem

How Transformers Utilize Multi-Head Attention in In-Context Learning? A Case Study on Sparse Linear Regression

Neural Information Processing SystemsMar-22-2026, 16:04:43 GMT

Despite the remarkable success of transformer-based models in various real-world tasks, their underlying mechanisms remain poorly understood. Recent studies have suggested that transformers can implement gradient descent as an in-context learner for linear regression problems and have developed various theoretical analyses accordingly. However, these works mostly focus on the expressive power of transformers by designing specific parameter constructions, lacking a comprehensive understanding of their inherent working mechanisms post-training. In this study, we consider a sparse linear regression problem and investigate how a trained multi-head transformer performs in-context learning. We experimentally discover that the utilization of multi-heads exhibits different patterns across layers: multiple heads are utilized and essential in the first layer, while usually only a single head is sufficient for subsequent layers. We provide a theoretical explanation for this observation: the first layer preprocesses the context data, and the following layers execute simple optimization steps based on the preprocessed context. Moreover, we demonstrate that such a preprocess-then-optimize algorithm can significantly outperform naive gradient descent and ridge regression algorithms. Further experimental results support our explanations. Our findings offer insights into the benefits of multi-head attention and contribute to understanding the more intricate mechanisms hidden within trained transformers.

artificial intelligence, machine learning, transformer utilize multi-head attention, (11 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)

Add feedback

f8580959e35cb0934479bb007fb241c2-Paper.pdf

Neural Information Processing SystemsFeb-11-2026, 23:57:34 GMT

algorithm, privacy, representation, (13 more...)

Neural Information Processing Systems

Country:

North America > Canada > British Columbia > Vancouver (0.04)
Asia > Middle East > Jordan (0.04)
Africa > Ethiopia > Addis Ababa > Addis Ababa (0.04)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Add feedback

A Data and Code Availability

Neural Information Processing SystemsFeb-9-2026, 15:36:11 GMT

The implementations of the experiments on ABC and FTDC datasets are similar. For the stability analysis, we are interested in the norm of term 1. In Section E.1, we briefly discuss the motivation behind studying age prediction and PCA-based statistical analysis in this context. In Section E.2, we provide additional details on cortical thickness data acquisition. In Section E.3, we report the results for stability analysis of VNNs and PCA-regression models for FTDC100 ( In Section E.4, we study the stability of VNNs on two simulated In Section E.5, we include additional figures A promising application of brain age prediction is early detection of neurodegenerative diseases (such as Alzheimer's, Huntingson's disease) which may manifest themselves as error in age prediction in pathological contexts by machine learning models trained E.4 Stability of VNNs on Synthetic Data We consider two settings for synthetic data.

artificial intelligence, dataset, machine learning, (16 more...)

Neural Information Processing Systems

Country: North America > United States > Pennsylvania (0.04)

Industry: Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (0.54)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.36)

Add feedback

1ae6464c6b5d51b363d7d96f97132c75-AuthorFeedback.pdf

Neural Information Processing SystemsOct-2-2025, 08:03:16 GMT

algorithm, artificial intelligence, machine learning, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.52)

Add feedback

Linear Regression in p-adic metric spaces

Baker, Gregory D., McCallum, Scott, Pattinson, Dirk

arXiv.org Artificial IntelligenceOct-2-2025

Many real-world machine learning problems involve inherently hierarchical data, yet traditional approaches rely on Euclidean metrics that fail to capture the discrete, branching nature of hierarchical relationships. We present a theoretical foundation for machine learning in p-adic metric spaces, which naturally respect hierarchical structure. Our main result proves that an n-dimensional plane minimizing the p-adic sum of distances to points in a dataset must pass through at least n + 1 of those points -- a striking contrast to Euclidean regression that highlights how p-adic metrics better align with the discrete nature of hierarchical data. As a corollary, a polynomial of degree n constructed to minimise the p-adic sum of residuals will pass through at least n + 1 points. As a further corollary, a polynomial of degree n approximating a higher degree polynomial at a finite number of points will yield a difference polynomial that has distinct rational roots. We demonstrate the practical significance of this result through two applications in natural language processing: analyzing hierarchical taxonomies and modeling grammatical morphology. These results suggest that p-adic metrics may be fundamental to properly handling hierarchical data structures in machine learning. In hierarchical data, interpolation between points often makes less sense than selecting actual observed points as representatives.

artificial intelligence, machine learning, polynomial, (15 more...)

arXiv.org Artificial Intelligence

2510.00043

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Supervised Learning > Representation Of Examples (0.62)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.53)

Add feedback

Risk Comparisons in Linear Regression: Implicit Regularization Dominates Explicit Regularization

Wu, Jingfeng, Bartlett, Peter L., Lee, Jason D., Kakade, Sham M., Yu, Bin

arXiv.org Machine LearningSep-23-2025

Existing theory suggests that for linear regression problems categorized by capacity and source conditions, gradient descent (GD) is always minimax optimal, while both ridge regression and online stochastic gradient descent (SGD) are polynomially suboptimal for certain categories of such problems. Moving beyond minimax theory, this work provides instance-wise comparisons of the finite-sample risks for these algorithms on any well-specified linear regression problem. Our analysis yields three key findings. First, GD dominates ridge regression: with comparable regularization, the excess risk of GD is always within a constant factor of ridge, but ridge can be polynomially worse even when tuned optimally. Second, GD is incomparable with SGD. While it is known that for certain problems GD can be polynomially better than SGD, the reverse is also true: we construct problems, inspired by benign overfitting theory, where optimally stopped GD is polynomially worse. Finally, GD dominates SGD for a significant subclass of problems -- those with fast and continuously decaying covariance spectra -- which includes all problems satisfying the standard capacity condition.

proposition 2, regression, ridge regression, (14 more...)

arXiv.org Machine Learning

2509.17251

Country:

North America > United States > California > Alameda County > Berkeley (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.88)

Add feedback

A Data and Code Availability

Neural Information Processing SystemsAug-15-2025, 15:39:30 GMT

The implementations of the experiments on ABC and FTDC datasets are similar. For the stability analysis, we are interested in the norm of term 1. In Section E.1, we briefly discuss the motivation behind studying age prediction and PCA-based statistical analysis in this context. In Section E.2, we provide additional details on cortical thickness data acquisition. In Section E.3, we report the results for stability analysis of VNNs and PCA-regression models for FTDC100 ( In Section E.4, we study the stability of VNNs on two simulated In Section E.5, we include additional figures A promising application of brain age prediction is early detection of neurodegenerative diseases (such as Alzheimer's, Huntingson's disease) which may manifest themselves as error in age prediction in pathological contexts by machine learning models trained E.4 Stability of VNNs on Synthetic Data We consider two settings for synthetic data.

covariance matrix, dataset, matrix, (14 more...)

Neural Information Processing Systems

Country: North America > United States > Pennsylvania (0.04)

Industry: Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (0.54)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.36)

Add feedback

How Transformers Utilize Multi-Head Attention in In-Context Learning? A Case Study on Sparse Linear Regression

Neural Information Processing SystemsMay-27-2025, 18:50:09 GMT

Despite the remarkable success of transformer-based models in various real-world tasks, their underlying mechanisms remain poorly understood. Recent studies have suggested that transformers can implement gradient descent as an in-context learner for linear regression problems and have developed various theoretical analyses accordingly. However, these works mostly focus on the expressive power of transformers by designing specific parameter constructions, lacking a comprehensive understanding of their inherent working mechanisms post-training. In this study, we consider a sparse linear regression problem and investigate how a trained multi-head transformer performs in-context learning. We experimentally discover that the utilization of multi-heads exhibits different patterns across layers: multiple heads are utilized and essential in the first layer, while usually only a single head is sufficient for subsequent layers.

in-context learning, sparse linear regression, transformer utilize multi-head attention, (5 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)

Add feedback

Kernel Banzhaf: A Fast and Robust Estimator for Banzhaf Values

Liu, Yurong, Witter, R. Teal, Korn, Flip, Alrashed, Tarfah, Paparas, Dimitris, Freire, Juliana

arXiv.org Artificial IntelligenceOct-10-2024

Banzhaf values offer a simple and interpretable alternative to the widely-used Shapley values. We introduce Kernel Banzhaf, a novel algorithm inspired by KernelSHAP, that leverages an elegant connection between Banzhaf values and linear regression. Through extensive experiments on feature attribution tasks, we demonstrate that Kernel Banzhaf substantially outperforms other algorithms for estimating Banzhaf values in both sample efficiency and robustness to noise. Furthermore, we prove theoretical guarantees on the algorithm's performance, establishing Kernel Banzhaf as a valuable tool for interpretable machine learning. The increasing complexity of AI models has intensified the challenges associated with model interpretability. Modern machine learning models, such as deep neural networks and complex ensemble methods, often operate as "opaque boxes." This opacity makes it difficult for users to understand and trust model predictions, especially in decision-making scenarios like healthcare, finance, and legal applications, which require rigorous justifications. Thus, there is a pressing need for reliable explainability tools to bridge the gap between complex model behaviors and human understanding. Among the various methods employed within explainable AI, game-theoretic approaches have gained prominence for quantifying the contribution of features in predictive modeling and enhancing model interpretability. While primarily associated with feature attribution (Lundberg & Lee, 2017; Karczmarz et al., 2022), these methods also contribute to broader machine learning tasks such as feature selection (Covert et al., 2020) and data valuation (Ghorbani & Zou, 2019; Wang & Jia, 2023). Such applications extend the utility of explainable AI, fostering greater trust in AI systems by providing insights beyond traditional explanations.

artificial intelligence, banzhaf value, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2410.08336

Country:

North America > United States > New York (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine (0.89)
Information Technology > Security & Privacy (0.34)

Technology:

Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Finite Sample Analysis of Distribution-Free Confidence Ellipsoids for Linear Regression

Szentpéteri, Szabolcs, Csáji, Balázs Csanád

arXiv.org Machine LearningSep-13-2024

The least squares (LS) estimate is the archetypical solution of linear regression problems. The asymptotic Gaussianity of the scaled LS error is often used to construct approximate confidence ellipsoids around the LS estimate, however, for finite samples these ellipsoids do not come with strict guarantees, unless some strong assumptions are made on the noise distributions. The paper studies the distribution-free Sign-Perturbed Sums (SPS) ellipsoidal outer approximation (EOA) algorithm which can construct non-asymptotically guaranteed confidence ellipsoids under mild assumptions, such as independent and symmetric noise terms. These ellipsoids have the same center and orientation as the classical asymptotic ellipsoids, only their radii are different, which radii can be computed by convex optimization. Here, we establish high probability non-asymptotic upper bounds for the sizes of SPS outer ellipsoids for linear regression problems and show that the volumes of these ellipsoids decrease at the optimal rate. Finally, the difference between our theoretical bounds and the empirical sizes of the regions are investigated experimentally.

assumption, confidence region, nr 1, (15 more...)

arXiv.org Machine Learning

2409.08801

Country:

Europe > Hungary > Budapest > Budapest (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Hawaii (0.04)
(3 more...)

Genre:

Workflow (0.67)
Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.91)

Add feedback

Filters

Collaborating Authors

linear regression problem

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

How Transformers Utilize Multi-Head Attention in In-Context Learning? A Case Study on Sparse Linear Regression

f8580959e35cb0934479bb007fb241c2-Paper.pdf

A Data and Code Availability

1ae6464c6b5d51b363d7d96f97132c75-AuthorFeedback.pdf

Linear Regression in p-adic metric spaces

Risk Comparisons in Linear Regression: Implicit Regularization Dominates Explicit Regularization

A Data and Code Availability

How Transformers Utilize Multi-Head Attention in In-Context Learning? A Case Study on Sparse Linear Regression

Kernel Banzhaf: A Fast and Robust Estimator for Banzhaf Values

Finite Sample Analysis of Distribution-Free Confidence Ellipsoids for Linear Regression